Submitted to the An^ials of Statistics 



ROBUST EMPIRICAL MEAN ESTIMATORS 



By Matthieu Lerasle and Roberto I. Oliveira 



CNRS- Universite Nice and IMPA 



Abstract: We study robust estimators of the mean of a probability 
measure P, called robust empirical mean estimators. This elementary 
construction is then used to revisit a problem of aggregation and 
a problem of estimator selection, extending these methods to not 
necessarily bounded collections of previous estimators. 

We consider then the problem of robust M-estimation. We propose 
a slightly more complicated construction to handle this problem and, 
as examples of applications, we apply our general approach to least- 
squares density estimation, to density estimation with Kiillback loss 
and to a non-Gaussian, unbounded, random design and heteroscedas- 
tic regression problem. 

Finally, we show that our strategy can be used when the data are 
only assumed to be mixing. 
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1. Introduction. The goal of this paper is to develop the theory and 
applications of what we call robust empirical mean estimators. As a first 
step, we show that one can estimate the mean m := Ep(X) of a random 
variable X with distribution P based on the observation of a finite sample 
Xi-^n ■= (Xi, . . . , Xn) i.i.d with marginal P. More precisely, our estimator 
7n(5) takes as input the sample Xi:n and a confidence level 6 G (0, 1), and 
satisfies: 



where cr^ is the variance of X and C is a universal constant. The classical 
empirical mean estimator fh := SILi satisfies such bounds only when 
the observations are Gaussian. Otherwise, some extra assumptions on the 
data are generally required and the concentration of fh takes a different 
shape. For example, if Xi < b, Bennett's concentration inequality gives 
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and heavier tails imply wider confidence intervals for the empirical mean. 

The goal of achieving (1) is not new. Catoni [12, 14] has recently devel- 
oped a Pac-Bayesian approach that has applied in several problems of non 
parametric statistics including linear regression and classification [1, 2, 13]. 
The main idea behind the constructions [14] is to do "soft-truncation" of 
the data, so as to mitigate heavy tails. This gives estimators achieving (1) 
with nearly optimal constant C ~ \/2 for a wide range of 5. 

One significant drawback of Catoni's construction is that one needs to 
know fj or an upper bound for the kurtosis k in order to achieve the bound 
in (1). In this paper we show that no such information is necessary. The 
construction we present, while not well-known, is not new: we learned about 
it from [20] and the main ideas seem to date back to the work of Nemirovsky 
and Yudin [29]. The main idea of these authors was to split the sample into 
regular blocks, then take the empirical mean on each block, and finally 
define our estimator as the median of these preliminary estimators. When 
the number V of blocks is about ln((5~^), the resulting robust empirical mean 
estimator satisfies (1). 

This result is sufficient to revisit some recent procedures of aggregation 
and selection of estimators and extend their range of application to heavier- 
tailed distributions. Before we move on to discuss these applications, we note 
that in all cases the desired confidence level 6 will be built into the estimator 
(ie. it must be chosen in advance), and it cannot be smaller than e""/^. 
While similar limitations are faced by Catoni in [14], the linear regression 
estimator in [2] manages to avoid this difficulty. By contrast, our estimators 
are essentially as efficient as their non-robust counterparts. This favourable 
trait is not shared by the estimator in [2], which is based on a non-convex 
optimization problem. 

We now discuss our first application of robust estimator, to the least- 
squares density estimation problem. We study the Lasso estimator of [31], 
which is a famous aggregation procedure extensively studied over the last 
few years (see for example [7, 11, 15, 17, 19, 24, 28, 34-36] and the refer- 
ences therein), using £i -penalties. We modify the Lasso estimator, adapted 
to least-squares density estimation in [11], and extend the results of [11] to 
unbounded dictionaries. 

We then study the estimator selection procedure of [3] in least-squares 
density estimation. Estimator selection is a new important theory that covers 
in the same framework the problems of model selection and of the selection 
of a statistical strategy together with the tuning constants associated. [3] 
provides a general approach including density estimation but his procedure 
based on pairwise tests between estimators, in the spirit of [8] was not com- 
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putationally tractable. A tractable algorithm is available in Gaussian linear 
regression in [5]. Our contribution to the field is twofold. First, we extend 
the tractable approach of [5] to least-squares density estimation. Our results 
rely on empirical process theory rather than Gaussian techniques and may 
be adapted to other frameworks, provided that some uniform upper bounds 
on the previous estimators are available. Then, we extend these results using 
robust empirical means. The robust approach is restricted to the variable 
selection problem but allows to handle not necessarily bounded collections 
of estimators. 

After these direct applications, we consider the problem of building ro- 
bust estimators in M-estimation. We present a general approach based on 
a slightly more complicated version of the basic robust empirical mean con- 
struction. We introduce a margin type assumption to control the risk of our 
estimator. This assumption is not common in the literature but we present 
classical examples of statistical frameworks where it is satisfied. We apply 
then the general strategy to least-squares and maximum likelihood estima- 
tors of the density and to a non-Gaussian, non bounded, random design 
and heteroscedastic regression setting. In these applications, our estimators 
satisfy optimal risk bounds, up to a logarithmic factor. This is an impor- 
tant difference with [2] where exact convergence rates were obtained using 
a localized version of the Pac-Bayesian estimators. 

Finally, the decomposition of the data set into blocks was used in recent 
works [4, 16, 22, 23] in order to extend model selection procedures to mixing 
data. The decomposition is done to couple the data with independent blocks 
and then apply the methods of the independent case. Similar ideas can 
be developed here, but it is interesting to notice that the extension does 
not require another block decomposition. The initial decomposition of the 
robust empirical mean algorithm is sufficient and our procedures can be used 
with mixing data. We extend several theorems of the previous sections to 
illustrate this fact. 

The paper is organized as follows. Section 2 presents basic notations, the 
construction of the new estimators and the elementary concentration in- 
equality that they satisfy. As a first application, we deduce an upper bound 
for the variance and a confidence interval for the mean. Section 3 presents the 
application to Lasso estimators. We extend the results of [11] in least-squares 
density estimation to unbounded dictionaries. Section 4 presents the results 
on estimator selection. We extend the algorithm of [5] to least-squares den- 
sity estimation and obtain a general estimator selection theorem for bounded 
collections of estimators. We study also our robust approach in this context 
and extend, for the important subproblem of variable selection the estimator 
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selection theorem to unbounded collections of estimators. Section 5 consid- 
ers the problem of building robust estimators in Af -estimation. We present a 
general strategy slightly more complicated than the one of Section 2 that we 
apply to the M-estimation frameworks of least-squares density estimation, 
density estimation with Kiillback loss and an heteroscedastic and random 
design regression problem where the target functions are not bounded and 
the errors not Gaussian in general. Section 6 presents the extension to the 
mixing case and the proofs are postponed to the appendix. 

2. Robust estimation of the mean. Let (X, ) be a measurable 
space, let Xi-n = {Xi, . . . , X^) be i.i.d, X-valued random variables with 
common distribution P and let X be an independent copy of Xi. Let V < n 
be an integer and let B = {Bi,...,By) be a regular partition of := 
i.e. 



(Creg) VK = l,...,y, 



n 



< 1 



We will moreover always assume, for notational convenience, that V < n/2 
so that, from (Creg); Cardi?i^ > n/V — 1 > n/{2V). For all non empty 
subsets A C In, for all real valued measurable functions t (respectively for 
all integrable functions t), we denote by 

For short, we also denote by Pn = Pi„- For all integers N, for all ui-^n = 
(ai, . . . , ajsf) in M^, we denote by Med(ai:7v) any real number b such that 

N N 
Card{i < N s.t. < 6} > —, Card{i = l< N s.t. > 6} > y . 

Let us finally introduce, for all real valued measurable functions t, 

Pet = Med{PB^t, K = 1,...,V}. 
Our first result is the following concentration inequality. 

Proposition 1. Let (X,^) be a measurable space and let Xi-n be i.i.d, 
^-valued random variables with common distribution P. Let / : X i— )• R 6e a 
measurable function such that varp(/) < oo and let 5 £ (0, 1). Let V < n/2 
be an integer and let B be a partition of In satisfying (Creg)- IfV^ lii('^~^); 
we have, for some absolute constant Li < 2\/6e, 



Pef -Pf> LVvarp/^^l < 6 
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Proposition 1 is proved in Section A.l. 

Remark 1 . An important feature of this result is that the concentration 
inequality does not involve the sup-norm of the data. This is the key for the 
developments presented in Sections 3 and 4. The drawbacks are that the 
level 5 has to be chosen before the construction of the estimator and that 
the process Pb is not linear. 

As a first application, let us give the following control for varp /. 

Corollary 2. Let (X, A') he a measurable space and let Xi-n be i.i.d, 
^-valued random variables with common distribution P. Let / : X i— ?• ]R 6e a 
measurable function such that varp if) < oo and let 6 £ (0, 1). Let V < n/2 
be an integer and let B be a partition of Ln satisfying (Creg)- Assume that 
V > ln{6-^) and that 



(C(f)) ^1 \/3lp/Vo< 



V varpp /y 1 



Then 

P{varp(/) <2Pe/'} >l-S. 

Proof. We can assume that Pf^ ^ 0, otherwise varp(/) = and the 
proposition is proved. We apply Proposition 1 to the function — since 
V > ln((5~^), we have 




We conclude the proof with assumption (C(f)). □ 

Using a union bound in Proposition 1 and Corollary 2, we get the following 
corollary. 

Corollary 3. Let A be a finite set and let n be a probability measure 
on A. Let V = (■i/'A)AGA be a set of measurable functions. Let {B\)\^j\ be a 
set of regular partition of Ln with Card(i3A) = V\ > ln(4(7r(A)(5)~"'^). Assume 
that 

(cm 

_ ^varp(Vf) 1 
^1 ;t79 \ — 1; 



VA G A, varp(^;^) < oo, and Li y— lp^2_^o < ^ 
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Then, for L2 = V^Li, we have 




In Sections 3 and 4, we present several applications of these results. 

3. Application to Lasso estimators in least-squares density esti- 
mation. Lasso estimators of [31] became popular over the last few years, 
in particular because they can be computed efficiently in practice. These 
estimators have been studied in density estimation in [11] for bounded dic- 
tionaries. We propose here to revisit and extend some results of [11] with 
our robust approach. This approach does not require boundedness of the 
dictionary as will be shown. 

Let us recall here the classical framework of density estimation. We ob- 
serve i.i.d random variables Xi-n, valued in a measured space (X, Af,//), 
with common distribution P. We assume that P is absolutely continuous 
with respect to ^ and that the density Si, of P with respect to belongs 
to L'^in)- We denote respectively by (., .) and ||.|| the inner product and the 
norm in L'^{fi). The risk of an estimator sof is measured by its L^-risks, 
i.e. ||s"— s^ll^. Given a collection (80)0^^ of estimators (possibly random) of 
Si,, we want to select a data-driven 6 such that 

P I V0 G G, \\s^- %f < C \\s^ - sgf + R{e) } > 1 - (5. 

In the previous inequality the leading constant C is expected to be close to 1 
and R{6) should remain of reasonable size. In that case, 's^ is said to satisfy 
an oracle inequality since it behaves as well as an "oracle" , i.e. a minimizer 
of II — Sell ^ that is unknown in practice. 

Let A be a finite set of cardinal M and let T> = {ipx, A G A} be a dic- 
tionary, i.e., a finite set of measurable functions. In the case of the Lasso, 
G C R^^ and, for aU 6 = (0A)AeA G 6, 



The risk of sg is equal to 



1^ + Pell^ - 2 ^ 6'a / Si^ipxdfi. 

AeA 
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Therefore, the best estimator, or the oracle, in the collection {sg)0^QQ^M 
minimizes the ideal criterion 

critid(^) = \\sef -2Y,0xPi^x- 
xeA 

The idea of [11] is to replace the unknown critjd by the following penalized 
empirical version 

CTitLas{9) = \\sef - 2 ^xPni^X + 2 J] '^A |^a| , 

AeA agA 

for proper choice of the weights uj\. We modify this Lasso- type criterion with 
our robust version. Let ;B be a regular partition of { 1, . . . , n} with cardinality 
V to be defined later. Let Pq be the associated empirical process introduced 
in Proposition 1. Our criterion is given by 

crit(0,i3) = \\se\\^ -2Y,0xPBi^x + 2Y,^x\Gx\ , 

aga AeA 

for weights uj\ to be defined later. Our final estimator is then given by s^, 
where 

9 = arg min | crit(^, B) } . 

For ah e £ M*^ let J(6') = { A G A, / 0} , M{e) = Card J(6l). Let 5 G 
(0, 1) and, for every A, A' G A, let 

/X (^A,^A') \ ^\ \'\\ 

'''^ '^^¥J¥hAy maxmaxMA,A)| , 

AgJ(6I) A'>A AgJ(6») 



In(^) AeJ(9) i IIV-aII j ' V n x&A { ujx 

Let us call Tm the gram matrix of T>, i.e., the matrix with entries pm{X, A') 
and let Cm be the smallest eigenvalue of Tm- The following assumptions 
were used in [11] to state the results. 

(Hi(0)) 16GF{9)M{e) < 1 . 



(H2(^)) 16GF{e)p,{e)^/Mi6) <1 . 

(H3(^)) Tm > and (m > km > ■ 

This estimator satisfies the following Theorem. 
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Theorem 4. Let 6 G (0, 1), ktV = {Tpx)xeA, le.t B he a regular partition 
o/{l,...,n}, with V > ln(4M5-i). Assume that (C(2?)) holds for n the 
uniform distribution on A and, for all X £ A, Vx = V . If (H.i{9)), (H2(0)) 
or {iis{6)) hold, the estimator s^ defined with L3 = 2L2 and 



^\ > L3\/PbiPI\I — 



satisfies, for all a > 1, with probability larger than 1 — 26, \/6 G 0, 
i2 a 



M . ^-^11 +^jr— TTzZ'^^ r^-^^ - — 7 Ne-s.r + — • 

^ ' AeA 

Where the remainder term R{0) is equal to F'^{9)M{6)ln{4M6~^)/n under 
(Hi(0)) or (H2(^)) and to G{9)/{nKM) under (H3(^)). 

Remark 2. The proof of Theorem 4 is decomposed into two proposi- 
tions. The main one, Proposition 5 below, follows from the proofs of The- 
orems 1, 2 and 3 of [11] that won't be reproduced here. We refer to this 
paper for the proof and for further comments on the main theorem. Let us 
remark that the improvement that we get using our robust approach is that 
we only require Pipx ^ ^ ™ results whereas in [11], it is required that 



Remark 3. An interesting feature of our result is that it allows to revisit 
famous procedures built with the empirical process. Pn has to be replaced 
by Pjs for a proper choice of V and Bennett's, Bernstein's or Hoeffding's 
inequalities can be replaced by Proposition 1. Theorem 4 is just an example 
of this general principle. 

Proposition 5. Under assumptions (Hi(6')), (H2(6')) or {Yis{9)), the 

following condition 
(2) 

V^ee, \\sg- s^f + Y,^\\Gx-Ox <\\se-s4^ + A Yl 



|2 

+ 

AgA AgJ{6») 

implies, for all a > 1, \/9 £ Q, 

a + 1 ,,2 8a 



ii2 a v-^ K n ^ ct + i ,,2 oa" 

^4 +^7^E-aK-^a <^^\\se-s.f + ^^Ri9) . 



AeA 
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In order to prove Theorem 4, we only have to ensure that Condition (2) 
holds with the required probability, which will be done in the following 
proposition. 

Proposition 6. Let 5^ {Q, I). Let B he a regular partition of {I, ... ,n}, 
with V > ln(4M(5-i) and assume that (C(P)) holds. Assume that, 



Proposition 6 is proved in Section A. 2. 

4. Estimator Selection. Estimator selection is a new theory devel- 
oped in [3, 5]. It covers important statistical problems as Model Selection, 
aggregation of estimators, selecting tuning constants in statistical proce- 
dures and it allows to compare several estimation procedures. [3] developed 
a general approach in the spirit of [8]. It applies in various frameworks but 
it does not provide a method for practitioner because the resulting estima- 
tors are too hard to compute. On the other hand, [5] worked in a Gaussian 
regression framework and obtained efficient estimators. The approach of [5] 
can be adapted to the density estimation framework as we will see. In order 
to keep the paper of reasonable size, we won't give practical ways to de- 
fine the collections of estimators. This fundamental issue and several others 
are extensively discussed in [5], we refer to this paper for all the practical 
consequences of the main oracle inequality. As [5] worked in a Gaussian 
framework, they do not require any L°°-norm. In order to emphasize the 
advantages of robust empirical mean estimators in this problem, let us first 
extend the results of [5] to the density estimation framework where condi- 
tions on L°°-norms are classical [6, 9, 23]. 

4.1. The empirical version. Let (se)e£Q be a collection of estimator of 
s*. Let {Sm)m&M be a collection of linear subspaces of measurable functions 
and for all G 0, let Mq be a subset of A^, possibly random. For all € 
and all m G Me, we choose an estimator se^m, for example, the orthogonal 
projection of 'sg onto Sm- Finally, we denote by pen : Ai — t- M''" a function 



(3) 




Then, with probability larger than 1 — 6, \/9 £ Q 



Sg- Si^lf + '^ujx\0x-0x < ||s0 - + 4 ^ ujx Ox-Ox 
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to be defined later. Let a > and let 

crit„(6') = min | „J|^ - 2P„S0 „ + a pgi ^ - + pen(m) I , 

rn€Mg I ) 

(4) ^ = arffmincrita(0) . 

See 

Let V{M) be the set of probability measures on A^, let M(m) be the unit ball 
in L^-norm of Sm, B(m) = {t £ Sm, \\t\\ < 1} and let = s^Pt£M{m) 

1 1 2 

Let nio be a minimizer of n ||s — Sm|| +P^m- Let us introduce the following 
assumption. 

37T G r{M), En ^ 0, (5o G (0, 1) S.t. 

(CSED) ym^M, "''""'"^°"°"^(^ <e„. 

The following result holds. 

Theorem 7. Let Xi-n be i.i.d, ^-valued, random variables with common 
density G L'^{fi). For all m G Ai, let Sm be the orthogonal projection 
of Si, onto Sm and assume that (CSED) holds. Let S > 6o and let e'^ = 
4-y/e^ + en/3 and let Uq such that, for all n > Uq, e'^ < 1/2. Let also 



^ll^mlLln 



7r(m)5 



n 



Let 6 be the estimator (4) selected by a penalty pen such that, for some 
i/G (0,1), 

5 , \ -P^m , 2Lo\\s\\ , 2Lo 2 



pen(m) > - + 2Loz^ \ rmiS) H o-r„,((5) , 

\2 J n V 

where Lq < 16(ln2)~-'^ + 8. There exists a constant L^ such that, for all 
n > Uo, with probability larger than 1 — 6, £ Q, 

(5) La\\s^-sJ\ <\\s0-Si,\\ + min <^ „ - sqU +2pen(m)^ . 

Theorem 7 is proved in Section A. 3. 

Remark 4. It is shown in [22] that (CSED) typically holds in classical 
collections of models as Fourier spaces, and growing wavelet or histogram 
spaces, under reasonable assumptions on the risk of the estimator, for 6o = 
0{n~'^). We refer to this paper for further details on this assumptions. 
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Remark 5. It is usually assumed that, for some constant T, ||^'m|loo — 
TP^jn. In that case, for those m such that P^m — s- oo, we have rm{S) = 
o(?i~^P^'m) SO that the condition on the penalty is asymptotically satis- 
fied as soon as pen(m) > 2.5P'^m/n. This last condition holds if we choose 
pen(m,) > 2.5 ||^m|loo example. A first application of our robust ap- 

proach is that, using Corollary 3, we can choose pen(m) > fyPs^m/n under 
the more reasonable assumptions (C(P)) with V = {^m)m&M- t^iat case, 
the condition on the penalty only holds with large probability but the in- 
terested reader can checked that this will only change 5 into 25 in (5). 

Remark 6. This result includes as a special case classical model se- 
lection framework of [6, 9] where Q = M and, for all m € A4, Sm is the 
projection estimator onto Sm- 

Remark 7. It covers also the problem of choosing a tuning parameter 
in a statistical estimation method. In that case is usually a subset of M 
and is the estimator selected by the statistical method, with the tuning 
parameter equal to 9. It allows also to mix several estimation strategies, in 
that case is typically the product of a finite set A describing the set of 
methods with a subspace of M or describing the possible values of the 
tuning parameters (see [5]). 

4.2. A robust version. We have already shown in remark 5 that our ro- 
bust approach can be used to build the penalty term in the "classical" esti- 
mator selection described above. However, this first approach relies on the 
assumptions that ||^m|loo — ^P^m-, \\sm — Sm'lloo — ^ W^rn — Sm'll- We will 
now present another approach, totally based on robust estimators, which 
works without these assumptions. In this section, is an orthonor- 

mal system in L^(^) and Aj\/ C A is a finite subset. Let {^m)m^M be a 
collection of subsets oi Km, let A„ = Umg»Am, and for all m E A^, let 5^ 
the linear span of ('(/'a)agA„i- Let {se)e<^@ be a collection of estimator of s*. 
For all G 0, let J^e be a subset of 7W, possibly random. For all G and 
all m G Me, we choose an estimator sg^rn, for example, the orthogonal pro- 
jection of 'sq onto Sm- For all G 0, for all m G Me and for all A G A^, let 
= {se^m, ^a)- For all A G A„, let B\ be a regular partition of { 1, . . . , n}, 
with cardinality Vx to be defined later. Let pen : Ai — )■ a function to be 
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defined later. Let a > and let 



cnta{9) = min < 



|se,mf -2 ^ ^^PH^'0A + ap6»,m-S6if + pen(m) 
AeAm 



(6) 6 = aremincritrvf^) . 

This estimator satisfies the following theorem. 

Theorem 8. Let Xi-n he i.i.d, ^-valued random variables with com- 
mon density € L'^{p)- Let 6 G (0,1) and e € (0,1/4). Let {'sg)e£e be a 
collection of estimators, let {ijjx)x£Aj ^M, {^m)m&M7 {Sm)meM! (Me 
and {se^rn)eee,mGMe defined as above. Let n be a probability measure on 
Am- Assume that, for all A £ Am, Vx > ln(2(7r(A)(5)~^) Then, the estimator 
defined in (6) with pen(m) > X^^eA varp {'>px)Vx, where L4 = 9Lf/4 
satisfies, with probability larger than 1 — 5/2, \/6 G 0, 

(1 -4e) Aa ||2 ^ ||2 ^ ■ f ^ 1,2 , . ^\ 

Theorem 8 is proved in Section A. 4. 

Remark 8. In order to choose the penalty term, one can use Corollary 
3. Under assumption (0(2?)) for the dictionary {^px)x£AM^ '^^ have, with our 
choice of Vx , 

P { VA G Am, ya^Tp{,Px) < 2Pb,^1 } > 1 " 2 ' 

we can therefore use the data-driven penalty 

2 T 

pen(m) = ^ E (^^5.V'!)^A • 

AeAm 

Remark 9. Compared to Theorem 7, we see that the collection of mod- 
els {Sm)meM is restricted here to a family generated by an orthonormal 
system {i1^x)x£Am^ penalty term is in general heavier, which yields a loss 
(of order sup_)^g^^ V^) in the convergence rates. On the other hand, we do 
not require ||^m|loo be finite, we only need a finite moment of order 4 
for the functions ipx- Moreover, in order to build a data-driven penalty in 
Theorem 7, we asked that ||^'m|loo (-P^m)^> whereas we only require 



now a bound on y Pi/jj^/Pipj^lp^j^^Q of order smaller than ^JnjYx- In or- 
der to emphasize the difference between these conditions, let us consider 
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the case of an histogram, where for some disjoint measurable sets (/A)AeA,„) 
V'A = (/^(A))~"'^^^1/a- case, we have 

IhT, II 1 D/2 /r> i4 VPh 

= ^.Z Khv "^^^ = wry V = ^ • 

If s-^: is upper bounded by 0+, we deduce that < c+dm-, where dm is 

the number of pieces of the histogram. If, moreover, is lower bounded by 
c_, we have 

1 

The condition of Theorem 8 is therefore satisfied as soon as, for some con- 
stant r sufficiently large, fJ.{I\) > r~^nV^~^ whereas the condition of Theo- 
rem 7 holds only if i^{Ix) » d^. This last condition is much more restric- 
tive when the histograms are irregular. 

Remark 10. Important collections (^a)agA are the wavelet spaces, used 
for example in [18]. It is shown for example in [9] that the hard thresholded 
estimator of [18] coincide with the estimator chosen by model selection with 
the penalty dm^{n), where i{n) is the threshold. Moreover, the soft thresh- 
olded estimator coincide with the Lasso-estimator presented in the previous 
section (see for example [11] for details). Our estimator selection procedure 
can be used with wavelet estimators; it allows to select with the data the 
best strategie, together with the best thresholds. 

5. Robust Estimators in Af-estimation. The rest of the paper is 
devoted to the study of robust estimators in a more general context of M- 
estimation. These estimators will be defined using a slightly more elaborated 
construction than the one presented in Section 2. This general principle will 
then be applied to classical problems as density estimation and regression. 

5.1. The general case. Hereafter, (X.,X) denotes a measurable space, 
7 : X — )■ M is a measurable function, called contrast, P be a probability 
measure, we want to estimate the target s^, = argmin^g^t" P7(i) based on 
the observation of i.i.d random variables Xi-n = ^i, ■ ■ ■ with common 
distribution P. Let 8 < V < n/2 and let ;S be a a regular partition of 
{ 1, n}, with cardinality V. Let Xbj^ = {Xi)i^BK a^^id let S" be a subspace 
of X. Let 'sk be any estimator defined as a function 'sk = F{XBji) valued 
in S. For all i^', i^' = 1, . . . , y and for all functions let 

PK,K't = Med { Psjt, J^K,K'}. 
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Our final estimator is defined as 
(7) ^ 

's='sk^, where JT* = arg min max \ Pr K' ilis^K) — li^^K'))} ■ 

K=1,...,V K'=1,...,V ^ , \ V / //J 

The risk of an estimator 's is measured with the excess risk 

£(s,s.) = P(7(s)-7(s.)) ■ 

We denote by Sq = inft^s Pli^)- We assume the fohowing margin type 
condition. 

3iV G N, (ai)i=i„„,^ < 1, iaf)i=o,...,N < oo, e < 1,^ withP(a) > 1 - e, 
s.t. on a, max||var[(7(?i^)-7(so))(X)| 

N 

(CMarg) < al£isK, Sof + a^£{sK, Sof'^'. 

1=1 

Remark 11. Assumption (CMarg) is not classical and might surprise 
at first sight. It will be discussed in Section 5.2. In particular, we show in 
this section that (CMarg) holds under few assumptions on the data and the 
space S in density estimation and in a general not bounded, heteroscedastic, 
non Gaussian and random design regression framework. 

We finally denote, for all real numbers a by \a\ the smallest integer b such 
that b > a. Our result is the following. 

Theorem 9. Let Xi-n be i.i.d random variables and let 6 > such 
that \ln{6~'^)~\ < n/2. Let B be a regular partition of {1,. . . ,n}, with V = 
|'ln(5~^)] V8. Let {'sk)k=i,...,v be a sequence of estimators satisfying (CMarg) 
and let s/^^ be the associated estimator defined in (7). Let Co = Li and for 
alli = l,...,N, /et = 4(1 - ai)(Lia"OV(i-".). For all A > 1, let 

u,,{A) = Coao^ + ^, R,.,{A) = Y,a(^A^'<^^^|^y 
For all A > 1, with probability larger than 1 — 6 — e, 

il-l^niA))i{sK^,So) < {l + 3MA))mf{i{sK,So)} + Rn{A) . 

In particular, for all A, n such that Vn{A) < 1/2, we have, with probability 
larger than 1 — e — 6, 

i(sK^,s^) < iiso, «*) + (! + 8i/„(A) ) ini {i{sK, So) } + 2i?„(A). 

K 
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Theorem 9 is proved in Section B.l. 

Remark 12. Assume that, for all K, 'sk = arg min^g^ Pg^t and let 'Sn 
denote the classical empirical risk minimizer, = aigmmt^s Pnl{t)- is 
known to have a nice behavior when the contrast 7 is bounded and under 
some margin condition such as (see [25, 27] and the references therein), 

3a G [0, 1] and A>1; s.t. Vt G S, varp((7(t) - j{so))) < A£{t, SoT ■ 

Our approach does not require a finite sup norm for 7. However, it should 
be mentioned that the confidence level 5 has to be chosen in advance and 
that we loose a log factor in the expectation, as the 'sk are built with only 
V/n data. 

5.2. Application to classical statistical problems. The aim of this section 
is to apply Theorem 9 in some well known problems. We show in particular 
that Condition (CMarg) holds in these frameworks. 

5.2.1. Density estimation with LF'-Ioss. Assume that Xi^n have common 
marginal density s^. Assume that G L^(/i). Then we have, for all t in 

\\s^ - tf = + - 2 / tsdn = ||s^f + ||tf - 2Pt = ||s^f + P-i{t). 



In the previous inequality, ^{t) = \\t\\^ — 2t, thus, s-i, = argminfgj;^2(^) Pl{i)- 
We also have 

|2 o / 2j II ||2 



P7(s*) = ||s^||^ - 2 j sldfi 

hence, 

s^) = ||t||^ — 2 J tsdfj, + \\si,"'^ 



Let S a linear space of measurable functions and let 

SK = argminPB^7(t). 

t^S 

'sk is easily computed since, for all orthonormal bases {'ip\)x^A of S, we have 

aga 

Let us denote by Sq the orthogonal projection of onto S. We have, from 
Pythagoras relation. 
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Given an orthonormal basis ('0a)agA of we have 



Vaga 



E(\\sK-so\n =E y ({Ps^-p)^^f 



EAGAvar(^A(^i)) _ P^-\\sa 



\Bk\ \Bk\ 
In the previous inequality, the function ^ is equal to 

= y^^l = supt^ where B = {t £ S, \\t\\ < 1} . 

The following proposition holds. 

Proposition 10. Let Xi-n be random variables with common marginal 
density with respect to the Lebesgue measure /u. Assume that G L^(;u). 
Let S be a linear space of measurable functions. Let V be an integer and let 
Bi, . . . , By be a regular partition o/{ 1, . . . , n}. For all K = 1, . . . ,V , let 'sk 
be any estimator taking value in S and measurable with respect to a{XBj^)- 
Then Condition (CMarg) is satisfied with N = 1, uq = Q, ai = 1/2, 

Proof. We have ^(sr) = — 2sx; hence, 

var(7(si^(X)) -7(so)| ) = 4 var ( (si^- - So)(^) | Xb^) ■ 

We can write 'sk = X^agA ^a'^a, with measurable with respect to cr(XB^ ) . 
From Cauchy-Schwarz inequality, using the independence between X and 
the Xi, 



var {{sK - So){X) \ Xb^) 



E 5](af -P^a)(^a(X)-PVa) 




aga 



Vaga 

= \\SK - SoW'^ (^P'^ - ||So||^ 

Therefore, (CMarg) holds with N = 1, ao = 0, ai = 1/2, af = P'i' - 

II l|2 1-1 
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We can deduce from Theorem 9 the following result. 

Proposition 11. Let Xi-n be i.i.d random variables with common den- 
sity Si, with respect to the Lebesgue measure fi. Assume that € L'^(p). Let 
5 > such that |'ln(5~^)] < n/2. Let B be a regular partition of {1, . . . ,n} , 
with V = \ln{5-^)] V 8. VJf = 1,...,^, let sr = arg mintgs Pij^7(t) 
and let 'sk^, be the associated estimator defined in (7). We have, for L5 = 
2Ve + 8Lle^/^ 



\SK,-Si,f > ||so-s^f + L5 f P^r- llsof ] < 2S . 

Remark 13. A classical density estimator is the minimizer of the em- 
pirical risk, 

'Sm = arg min P„7(t) . 

This estimator satisfies the following risk bound (see for example [23]), with 
probability larger than 1 — 6, for all e > 0, 

n \ ne e'^n^ 

In the last inequality = sup^g^ P{{t - Ptf) < P^! - ||sof , &^ = ll^llc^,- 
The empirical risk minimizer is better when the bound 6^ < n{P'^ ~ ll'Soll ) 
holds. However, our approach does not require that the sup- norm of the 
function ^ is bounded, we only need a finite expectation. 

Proof. Since (CMarg) holds, it comes from Theorem 9 that, for all 
A > 2, the estimator (7) satisfies, with probability larger than 1 — 5, 



(8) 



+ Li^[p^-\\sr'^ 



n 



Moreover, as E f \\sk — SoII^ j = ^'^\Bk\^^ ' regularity of the partition, we 



deduce that 



E( \\sK-So\?] < 2(P^- lis, "2 



V 



n 

Hence, from (8) and Lemma 26, with probability larger than 1 — 25, 

WSk. - sA? <\\so- ..f + (2^^+ 1^ + l?a) [p^ - Wsof) \ . 

We conclude the proof, choosing A = 4e^/^/Li. □ 
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5.2.2. Density estimation with Kiillback loss. We observe with den- 
sity with respect to the Lebesgue measure /x on [0, 1]. We denote by C 
the space of positive functions t such that P\ \nt\ < oo. We have 

= arg min { —P\n{t) } , hence 7(t) = — Int. 

Let 5 be a space of histograms on [0, 1], i.e. there exists a partition (/A)AeA 
of [0, 1], where, for aU A € A, n{Ix) ^ 0, such that all functions t in S can 
be written t = ^^eA '^^^h- We denote by 

So = arg min {—Pint} , and ViT = 1, . . . ,V, sk = arg min —Pbk ^"^t- 

t^S t^S 

It comes from Jensen's inequality that 

aga ^ ^ ^' AeA ^ 

Therefore, for all = 1, . . . , F 



For ah = 1,. . . ,V, the estimator s^^ has a finite Kiillback risk on the 
event 

Qj^ = {VA G A such that Pl/^ > 0, PekIi^ > 0} • 

In order to avoid this problem, we choose x > and define, for all K = 
1, . . . , y, the estimator 

(9) tK = 

1 + X 

This way, 'sk is always non-null and the Kiillback risk of 'sk is always finite. 
Finally, we denote by Creg{S) a constant such that 

(CR) min Ph > C;hS). 

AgA S.t. P/at^O 
The following result ensures that (CMarg) holds. 

Proposition 12. Let Xi-n be random variables with density with 
respect to the Lebesgue measure fj, on [0, 1]. Let S be a space of histograms on 
[0, 1]. Let V < n be an integer and let B be a regular partition of {1, . . . ,n} . 
For all K = 1, . . . ,V , let 'sk be the estimator defined in ( 9). Then, (CMarg) 
holds with do = 0, iV = 1, ai = 1/2 and erf = 2 + 31n(l + x"^) . 
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Proposition 12 is proved in Section B.2. 

Remark 14. As for the L^-loss, no assumptions on the observations are 
required to check (CMarg). 

Proposition 13. Let Xi-n with density with respect to the Lebesgue 
measure /i on [0, 1]. Let S be a space of histograms on [0, 1] with finite di- 
mension D and let Creg{S) be a constant satisfying condition (CR). Let 
6 > such that |'hi(5~^)] < n/2. Let B be a regular partition of {1, ... ,n}, 
with V = |"ln(5~^)] V8. For all K = 1, . . . ,V , let 'sk be the estimator defined 
in (9), with x = and let 'sk-^ be the associated estimator defined in (7). 
Then, for some absolute constant Lq, 

F [l{sK.,s.) > i{s., s.) + L,£M^ ( 1 + ^ ) ) < 25 . 

Proposition 13 is proved in Section B.3. 

5.2.3. Regression with L'^-risk. Let (^j, li)i=i,...,n be independent copies 
of a pair of random variables {X, Y) satisfying the following equation 

Y = s^{X) + a{X)e, with E{e\X) = 0, E{e'^\X) = 1, a, s^ e L'^{Px). 

Let t in L'^{Px), we have 

E{{Y-t{X)f)=K{{Y-t{X)f\ X)=E{{s,{X)-t{X)f + a\X)). 

Hence, = argmintgL2(p^) P7(t), with j{t){X,Y) = {Y - t{X)f. More- 
over, we have 

£{t,s^) = E{{s,{X) - t{X)f + a\X))-E {a\X)) = E{{s,{X) - t{X)f) 

Let S" be a linear space of functions and let Sq be the orthogonal projection 
of onto S in L^{Px). Let S = |t G 5, \\t\\^2(^p^) < l}, ^ = sup^gs We 
assume the following moments assumptions. There exist finite D and Mij, 
such that 

(CM) E{{Y - So(X)f'^{X)) < D, E(^2(X)) < Mvt. 

The following Proposition ensures that Condition CM implies Condition 
(CMarg). 
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Proposition 14. Let {X,Y),{{Xi,Yi))i=i^,„n be i.i.d. pairs of random 
variables such that, 

Y = s^{X) + a{X)e, where E{e\X) = 0, E{e^\X) = 1, s^,a e L^{Px) ■ 

Let S be a linear space of functions measurable with respect to L^(Px)- Let Sq 
be the orthogonal projection of onto S and assume the moment condition 
(CM). Let V < n be an integer and let B be a regular partition o/{ 1, . . . , n}. 
For all K = 1, . . . , V , let 'sk = argminjg5 PBjfj{t). Then, (CMarg) holds, 
with al = 2M^, N = I, ai = 1/2, af = 8D. 

Proposition 14 is proved in Section B.4. We can now derive the following 
consequence of Theorem 9. 

Proposition 15. Let {X,Y),{{Xi,Yi))i=i^,,,n be i.i.d. pairs of random 
variables such that, 

Y = s^{X) + a{X)e, where E{e\X) = 0, E{e^\X) = 1, s^, a G L'^{Px). 

Let S be a linear space of functions measurable with respect to L'^{Px). 
Let So be the orthogonal projection on onto S and assume the moment 
condition (CM). Let 5 > such that |"ln(5~^)] < n/2. Let B be a regular 
partition o/{l,...,n}, with V = |"ln((5~2)"| v 8. For all K = 1, . . . ,V, let 
'sk = argminjg5 PB^7(t) and let 'sk^, be the associated estimator defined in 
(7). If9QeM^V < n, then, for L7 = 384 + 128\/2eLi, 

ip(^(sx.,s*) <^(So,S^) + ^7^) > 1-3(5. 

Proposition 15 is proved in Section B.5. 

6. Extension to mixing data. An interesting feature of our approach 
is that it can easily be adapted to deal with mixing data, using coupling 
methods as for example in [4, 16, 22, 23]. Let us recall the definition of 
/3-mixing and (/^-mixing coefficients, due respectively to [33] and [21]. Let 
(fi. A, P) be a probability space and let X and y be two cr-algebras included 
in A. We define 



I J 

Y,Y.\^{AnB,}-F{A,}F{B,}\ 

i=l j=l 



(f,{X,y)= sup supF {B\A} -F{B} . 

AeX, ¥{A}>OB£y 
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The first sup is taken among the finite partitions (^i)i=i,...,/ and j 
of Q such that, for all i = 1, I , Ai G X and for all j = 1, . . . , J, Bj G y. 
For all stationary sequences of random variables defined on ($7, A, P), 

let 

/3fc = /3(a(X„2<0),a(X„z> A:)), (j)k = H<^{X„i < 0),a{X„i > k)) . 

The process (X„)„gz is said to be /3-mixing when — )• as A; — )■ oo, it 
is said to be (^-mixing when (j)^ — t- as /c — )• oo. It is easily seen, see for 
example inequality (1.11) in [10] that f3{X,y) < (j){X,y) so that {Xn)n£Z 
is (/)-mixing implies {Xn)nez is /3-mixing. 

6.1. Basic concentration inequality. In this section, we assume that n = 
2Vq. For aU K = 1,...,2V, let Bk =_{{K - l)_q + 1, . . . ,Kq}, Bmix = 
{Bi,..., B2v)- For every / : M M, let Pmixf = PBrm^f- Our first result is 
the extension of Proposition 1 to mixing processes. 

Proposition 16. Let Xi-n he a stationary, real-valued, (3 -mixing process 
and assume that 

C| = 2j^(/ + 1)A <oo. 

l>0 

There exists an event Qcoup satisfying P { i^coup } ^ 1 — '2V Pq such that, for 
all f such that ||/||4p := {Pf*)^^^ < oo, for all V > ln(25~"'^), we have, for 
Ls = 4^6^, 

P f^Pmixf - Pf > Lsy^WflU^p ^ nn.oup'^ <^ ■ 

If moreover, Xi-n is (p-mixing, with X]"=o '^9 — ^^^'^^ /^'^ / such that 
Pf^ < oo, for all V > ln{26-^). 



I Pmixf -Pf> Ls^V^avpf^ 



n ^coup } <S 



Remark 15. This result allows to extend the propositions of Section 2 
and 3. We see that we only have to modify slightly the estimation procedure, 
choosing Pmix instead of Pb to deal with these processes and that the price 
to pay is to work under higher moments conditions. Under these stronger 
conditions, all the results remain valid. 
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Proof. For all x > and all V, if the following bounds are satisfied 

Med{PB^f, K = 1,3,...,2V -1}-Pf <x 
Med{PB^f, K = 2,4,...,2V}-Pf <x . 

Then, there is at least V values of PbkI smaller than Pf + x, so that 
Pmixf - Pf <x. Hence 

r{Prnixf -Pf>x] <P{Med{PB^/, = 1,3, . . . ,2F - 1} - P/ > x} 

+ P{Med{PB,,/,K = 2,4,...,2y}-P/>x} 

The proof is then a consequence of Lemma 17 below. □ 

Lemma 17. Let X\-n be a stationary, real-valued, j3-mixing process and 
assume that 

C| = 2^(/ + 1)A <oo. 

«>o 

For alia ^ { 0, 1 }, there exists an event satisfying P { i^coup } ^ ^~^Pq 

such that, for all f such that \\f\\^ := {Pf^f^^ < oo, for all V > ln{26~'^), 
we have, F (^{n^^Y n ilconp) < V2, where fl^ is the set 

Med{PB^f, K = l + a,3 + a,...,2V-l+a}-Pf<Ls^/C~^\\fh^J^ . 

If moreover, Xi^n is (j)-mixing, with X^^Lo — then, for all f such that 
Pf^ < oo, for all V > ln{26~^), we have, P ((J^^)^ n n'^^up) < ^2, where 
0^ is the set 

Med{PB^f, K = l + a,3 + a,...,2V -l + a}-Pf < L^^^J^sxp f^ . 

Lemma 17 is proved in Section C.l. 

6.2. Construction of Estimators. The purpose of this section is to adapt 
the results of Section 5 to our mixing setting. Let ^ > 8 and assume that n 
can be divided by 2V. Let us write q = n/[2V) and, for all X = 1, . . . , 2V , 
Bk = { {K — l)q + 1, . . . , Kq}. Let us then denote by X = u]^^^i?2/<-i C 
{ 1, . . . , n}. We define now, for all K ^ K' = I, . . . ,V , 
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Let us assume, as in Section 5, that, for all K = 1, . . . ,V, 'sk = F{Xb2k-i)- 
Our final estimator is defined now by S™*^ = 'sj^mix , where 

(10) i^r- = arg ^ min max^^ { P™^, {^{sk) - i{sk')) } • 

Our result is the following. 

Theorem 18. Let Xi-n be (j)-mixing random variables with Ylq=i — 
$2. Let6>0 such that \ln{26-'^)] < n/2 and let V = \ln{25-^)] V 16. Let 
(sk)k=i,...,v be a sequence of estimators such that 'sk = Fk{X2k-i) o-iT'd 
assume that (CMarg) holds. Let S^*^ be the associated estimator defined 

in (10). LetCo = L8<^, and'ii = l,...,iV, Q = (1 - a^) (Co(ai)''0^^^^""'^- 
For all A> 1, let 



1 

1 — a,- 



i=l 

For all A > 1, we have, with probability larger than 1 — 5 — e — V(3, 



1' 



(l-M„(A))^(F"^So)<(l + 3z.„(A))^ inf U{s2K-USo)} + Rn{A) . 

-TV =1,..., V 

Ln particular, for all A, n such that fn(A) < 1/2, we have, with probability 
larger than 1 — e — 5 — V/Sg, 

^(S--,s,) <£(s„,s,,) + (l + 8M„(A))^jnf ^{e{s2K-i,So)} + 2Rn{A). 

Remark 16. The only price to pay to work with these data is therefore 
the V/3q in the control of the probability and a small improvements of the 
constants. When 6,€ = 0(n~^) and f3q < Cq~^^^^^ for 6* > 1, we obtain that 
this probability is 0(n~^) in both cases, the price is negligible. 

6.3. Application to density estimation. Proposition 10 ensures that Con- 
dition (CMarg) holds for all stationary processes and all estimators. In or- 
der to extend Propostion 11 to mixing data, we only have to extend the 
inequality on E ^ \\'sk — •So||^^ • The result is the following. 

Proposition 19. Let Xi:n be a (p-mixing process such that X]^^;^ (pq < 
$, with common marginal density s-^. Assume that G -^^(/u)- Let 5 > 
such that \ln{6~'^)] < n/2 and let V = \\n{5-'^) \ V 16. Let Bi,... , B2V be a 
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regular partition o/ { 1, . . . , n}. For all K , let 'sk = argmintgs PBjf^{t) and 
lets^^^^ be the associated estimator defined in (10). We have 

Proof. As, on ^good the estimators are equal to those built using the 
independent data {Ak)k=i,...,v, we only have to prove, thanks to Theorem 
18 and Lemma 26 that 

E - Sof ) < 

to obtain the result. We have, from inequality (23), 

□ 
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APPENDIX A: PROOFS OF THE BASIC RESULTS 
A.l. Proof of Proposition 1. We have 

Pb! -Pf = Med{PB,J -Pf, K = l,...,V} . 

Denote, for ah x > 0, by A^^ = Card {K =1,...,V, s.t. Pb^/ -Pf >x}. 
For all y < n, we have 

P{Pb/-P/>x} <p|Ar, > || . 

Now let r > \/2 to be chosen later. It comes from Tchebychev's inequality 
that 
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Let us also introduce a random variable B with binomial distribution B(y, r). 
It comes from Lemma 23 that 



Now, it comes from equation (2.8) in Chapter 2 of [26] that 



V 



■ In 



< 5. 



B> — \ < e 2 '"V 4r{l-r) 

- 2 / - 

We choose r = (1-Vl - e-2)/2, so that 4r(l-r) = e"^. For all F > ln((5~^), 
we have 

(12) P I Pi,/ - P/ > I 

Finally, we have r > l/(12e). 

A. 2. Proof of Proposition 6. Let us first remark that, for all 9 & Q, 
we have 

crit(^,i3) = \\sef -2^exPBi^x + 2Y^ 
AeA AeA 

= \\sef - 2 J2 ^aPV'a + 2 J2 ^^^P - ^6)V'A + 2 5^ I^aI 
AeA AeA AeA 

= \\se - - + 2 Gx{P - PB)i^x + 2 c^A |^a| ■ 

AeA AeA 

By definition of 0, we have therefore, for all G 0, 

11^^- <\\se- s^f + 2Y,{0x- 0x){P - Pb)^x 

AeA 

+ 2^wa|^a| -2 J^t^A 

AeA AeA 

Let ^good be the event 

V 



VAgA, |P^a-P^a1 <i2V^BV'AY- 

Since V > ln(4M5~^) and (C(2?)) holds, it comes from Corollary 3 for the 
dictionary V and the uniform probability measure on A that "^{^good} ^ 
1 — 5. Moreover, on ^good 

VAG A, 2|(P-Pb)^a| <^x . 
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On ^good, for all € ©, by the triangular inequality, we have then 
\\Sg- S^\f + \0x - Ox 



aga 



< Wse-s^"^ + 2^ujx\9x-ex + 2 |^a| - 2 I^a 



aga 

< \\so - s^f + 2 ^A 

AgJ(6») 

< i|s6i - +4 ^A 

AgJ(6») 



AgA AgA 

0x-0x\+2 ^aI^aI-2 Y1 

AgJ(6I) AgJ(0) 

^a-^a' 



A. 3. Proof of Theorem 7. Let us first remark that, for all nio G M, 

= min \ \\sg^rn - S*f - 2(P„- P){se,m - Sm - Smo + Sm) 



+a \\s0 . 



+ pen 



(m)| 



For all m G 7W, we denote by ('0A)AGAm orthonormal basis of Sm and 
by /3a = (^6»,m,, V'a)- Using Cauchy-Schwarz inequality and the inequality 
2ab < ea^ + e~^6^, for all 6* G 0, for ah m G Me, we obtain 



2\iPn-P)ise,m-Sm)\=2 



Wi-Pi^x)[{Pn-P)^X 



aga„ 



<2 /^(^^-pVa)^ / [{P^-P)i^,y 



xeA„ 



agA„ 



aga„ 



AgA„ 



,11, ||2 , EAGAj(^«-mA] 

t II *6,m *m II ~r 



Let now O,good be the event 



YiiPn- P)^xf < (1 + Lo^)^ + ^rm{6) + 

AGAm 
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It comes from Lemma 24 in the appendix that, for all z/ > and all 6 > 6o, 
F{i^good} >l — 6. Moreover, on Qgood, for all e > 0, we have 



We 



c 

choose e = 1/2 and n > Hq. Since 

^^^/^^^or \^^™^2LJ^ m^2L„2,., 
pen(m) > - + 2Loi^ + r„ 5) + -^r^[6) , 

on r^goodi using the triangular inequality, we have then 
crit„(0) + lls^f + 2(P„ - 

< min \^\\s0,m- s^,\f + a\\sgm-se\f + '2pen{m)\ 

m£Mg { Z J 

< 3 - s^ll^ + min M3 + a) „ - + 2 pen(m) I , 

rneMe ^ J 

crit„(0) + ||s^f + 2(P„-P)s„„ 

> min I ^ ||se,m - + a p6i,m - sell^ I 

mGAle 1^ 2 J 
By definition of ^, it follows that, on il.good, V0 G 

(iAf)ii%-..r 

< 3 pe - s*||^ + min M3 + a) pe m - ^ell^ + 2 pen(m) ^ . 

A. 4. Proof of Theorem 8. The proof follows essentially the same 
steps as the one of Theorem 7. Let mo be a minimizer of 2e ||s* — + 
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4e(en) ^ I]agA,„ ^arp ipxVx- We have 
crit„(0) + ||s.f + 2 Pi;x(PB,-P)i^^ 



aga„ 



min i pe,m - - 2 V (^^ - PV'a)(/^b,V'a - ^'^a) 



AgA„ 



+2 ^ aA(Pi3A --P)^A + ap0,m. -sef + pen(m) 



where, for all A G AmUA^^ , = Pi)\ ( 1agA™„ - IasA™ ) , so that EAeA„,uA™„ a 

1 1 2 

ll^m — Smoll • Using Cauchy-Schwarz inequality and the inequality 2a6 < 
ea} + e~^6^, for all G 0, for all m G A^g, we obtain 



2 5^ I^^-PVa 



aga„ 



\PbA\-P'^\\ 



(13) <2 /E(^A-^^A)y E [(Pb.^a-PVa)]' 
< 2; (^^ - PVa)' + i E [(^^5,Va - P^a)]' 

AGA™ AGAm. 

(14) =r]\\se,m-smf ^- [(Pb.^a-PV'a)]' . 



aga„ 



2 J] \ax(PB^-P)^> 



AGAmUA. 



m 'J >-mo 



V AGAmUAm„ V AGAmUA,„„ 



(15) 



AGAmUA 

mo 



<2e||s^-s™„f + 2e||s^-s^f + M E [(^Ba-^)V'a 

\ AGAmUA 

Let now Ogood be the event 

VA G Aa/, IPb.V'a - ^'V'aI < Li^varpCV^A)^^ ■ 
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Using a union bound in Proposition 1, since Vx > ln{2{ir{X)6) we have, 
^{^good} > 1 — (5/2. Moreover, on ilgood, from (14). for all r] > 0, we have 



2 



2 I^A-^^A \PB,^X-Pi'x\<V\\s9,ra-Smf + ^ V varp(^A)V^A- 

agA™, AeAn 
From (15), for all e > 0, on ^good-, 

AGAmUA 

'2 



< 2e - SmJI^ + 2e - s*||^ + — ^ varp('0A)VA 

\ AGAmUA 

^ 26 lls^ '5m„ll ~l~ 26 llSm S-^ll 



j2 

+ — V varp(^A)^A + V varp(V'A)'t^A 

AgA,„ AGAmo 



< 4e \\sm - s*f + — varp(V'A)Kx • 

The last inequality is due to the definition of nio- We choose rj = 4e, on 
^good, using the triangular inequality, we deduced that, for L4 = L2+LI/A = 
9Lf /4, crit„(6') + + 2 Eaga„„ -PV'a(?BaV'a - -PV'a) is smaller than the 
infimum over m £ Aig 

2 ^^^4 ^ — ^ 2 

(1 + 4e) ||s9,m - H > varp(V'A)V\ + a p6i,m - sell + pen(m) 

716 

AGAm 

2 2 ^^^4 ^ — ^ 

<4||se-Sv,|| + (4 + a) p6).m - sell H > varp(V'A)Vx + pen(m) . 

ne 

agA,„ 

On the other hand, critQ,(6') + ||sv,||^ + 2 Y1i\<^k Pi^x{PBx'^\~ P'^x) is larger 
than the infimum over m £ Mg oi 

(1 - 46) \\sg,m - s^\f - — var:p{ipx)Vx + a \\sg^rn. - sell^ + pen(m) 

ne 

agA„i 

> J((l-46) AQ)||se-s^f -— V varp('0A)^A+pen(m) . 
2 ne ^ 

agA™, 



30 



M. LERASLE AND R. OLIVEIRA 



Since ^ 

pen(m) > — vaTp{iJx)Vx , 

by definition of 0, we deduce that, on ^good, for all G 0, 

-((l-4e) Aq)||%-s^|| 

< 4 ||s6i - + min \ {A + a) \\sg^rn - s'elf + '^pen{m)\ 

APPENDIX B: PROOFS FOR Af-ESTIMATION 
B.l. Proof of Theorem 9. Let ^(^c) be the event 



N 

max llvar [{j{sk) - 7(^0) ) {X) | ] ||^ < a^eisK, Sof+Y. <rfe{sK, Sof"^ 



1=1 



We apply Proposition 1 to / = {^(sk) — ^iso)) 1^(0 to — /, condition- 
ally to the random variables Xb^^ and to the partition B = {Bj)j^k,K' of 
{!,... ,n} /{Bk U Bk'), with cardinality n - \Bk\ - \Bk'\ > n(l - 2/V - 
2/n) > n/2 since V > 8. We have 



TV 



var(/(X)|XB^.) < aXsK, Sof + J2 ^°)'"' ' 

1=1 

Since V > ln((5~^), Proposition 1 gives that, with probability larger than 
1 — 2(5^ , conditionally to Xbj^ , the following event holds 

> L 



^ fv 

\ i=l 



D f](C)n \iPK,K' - P) hisK) - 7(So))| 



As the bound on the probability does not depend on Xbj^ , the same bound 
holds unconditionally. We use repeatedly the classical inequality a"6^~" < 
aa + (1 — a)b. Let r„ = V n~^V, let Cq = Li and, for all i = 1, . . . , 



ROBUST EMPIRICAL MEAN ESTIMATORS 31 

let Cj = (1 — Oj) (Li(aj)"* )^/(^~"»)^ we obtain that the foUowing event has 
probabihty smaller than 25^, 

n^cP I (Pk,K' - P) hisK) - 7(So) )| 

/ iV\ ^ ^ 

< ( Coaorn + ^ iisK, So) + Q ( A"'a,r„ ) . 

^ ^ i=i 

Using a union bound, we get that the probability that the following event 
has probability larger than 1 — V{V — 1)5"^ — e, \/K, K' = 1, . . . ,V, 

\(Pk,k'-P){i{sk)-7{sk'))\ 

< ( Coaorn + ^ j {iisK, So) + iisK',So) ) + 2 2^ Ci ( A"'air„ ) . 
Given the value of y, we obtain that 

y(y-l)52 = ([ln(r2)J (^1^(^-2)J _i))^2< ((i^(^-2)^-^)j^(^-2))^2 

<5{ sup {5((ln(r2) + l)ln(r2))} ) < 5. 
V^e(o,i) J 

Hereafter, we denote by 

N ^ ^ 



i=l 



and by rigood(^) the event, 

K,K'=1,...,V 

< l^niA) {i{sK, So) + iisK',So) ) + -Rn(A)} . 

Let Ko be the index such that P^{sk) is minimal. For all n, A sufficiently 
large so that i^n(A) < 1, on ^good-, we have 

^{sK.^So) = {P {l{^K^ - l{^Ko))) + ^{^Ko.So) 

= [P (7(sirJ - l{sKo) ) - ^n{^) {^{sK.,So) + So) )] 

+ l/„(A) (^(Si^,, So)) + {l + Z^n(A) ) ^{sK,,So) 

= sup [P(7(s_kJ - 7(si^)) - fn(A) (^(sK^,So) +t{sK,So))] 
K 

+ l/„(A) (^(Si^,, So) ) + ( 1 + Z/n(A) ) ^(Si^„, So) 

< sup [-Pe',,/^ (7(sxJ - 7(5^-))] 

A' 

+ l/„(A) (^(SK,, So) ) + ( 1 + Z^„(A) ) So) + Pn(A). 
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By definition of K^,, we have 

{1- l^niA))iisK.,So) 

< sup [PKo,K (7(sxJ - lisK))] + (1 + M^) ) KsKo,So) + Rn{A) 

K 

< sup [P (7(S/<J - ^(sk) ) + l^n(A) ii{sKo, So) + ^SR, So))] 

K 

+ ( 1 + i/„(A) ) £{sK,,So) + 2ii„(A) 
= sup [(1 + z^„(A) ) So) - ( 1 - z/„(A) ) So)] 

+ ( 1 + ( A) ) £(sx„ , So) + 2i2„ ( A) 
= (l + 3i/„(A))^(si^„,So) +2i?„(A) . 

This concludes the proof of Theorem 9. 

B.2. Proof of Proposition 12. We have 

So _ (1 + x)Plj, ^ 

We deduce that 

We have, by Cauchy-Schwarz inequahty, 

(■(^K,So) = Soln { ^ ] = So In ( ^ ) - / So In ( — ) 

J \SK J Jso>SK \SK J Jso<SK \ So J 

(16) >[ Soln(^)-[ .oflnf^^y 

J So>sk \ J J So<SK V \ So J J 

Let us now recall the following lemma, see for example [26] Lemma 7.24. 
Lemma 20. For all probability measures P, Q with P « Q, 

l/.P...,(.(g))%/...(|) 



Using Lemma 20, we get 
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Plugging this inequality in (16), we obtain 



solni ^ <3 / Solnl ^ ). 

So>SK \^kJ J \SK 

Finally, we obtain 

var (7(5^ - 7(so) | Xsj^ ) < / So ( In 



So 

o<SK V V SK J J Jso>SK \ \ 



So 
SK 



<2 /solnf ^ ) +ln(x-^ + l) / Soln 

J \SK / Jso>SK 

< {2 + 3ln{x-^ + l))e{sK, So). 
(CMarg) holds with o-q = 0, iV = 1, ai = 1/2, 

al = 2 + 31n(l + x~^) . 

B.3. Proof of Proposition 13. Condition (CMarg) holds from Propo- 
sition 12. Hence, Theorem 9 ensures that, for all A > 2, the estimator (7) 
satisfies, with probability larger than 1 — 5, 

i{sK.,s.)<e{so,s,)+(l + ^] mf{£{sK,s,)}+LlA{2+3ln{l+x-'))- . 

\ A J K n 

Let us now fix some K in l,...,y and let us denote, for all A G A by 
ax = P{h), ax = We have 

E(.(..,.)).K(/(|:-;|^ln(|))..) 
A6W0 V V"^^^ 

Now, we have 

Inf^l =-lnfl-^^V with^^ = l-^<l ^ 



a\) V «A / ax ax {l + x) ' 

We use the following inequalities, for all n < 1 — 

-in(i -u)-u ^ r2in(r) ^ r 



w2 - (F - 1)2 ' r - 1 ■ 
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In 



ax 



< 



ax - ax 



ax J ax 
Moreover, we have, 



+ (l + x)^ln 



1+x 



+ 1 + x 



ax - ax 
ax 



0<2;< 1, andE(aA-aA) < 0, E {ax - ax) ) < + x 



ax 



\Bk\ 



We deduce that, for all A G A such that qa / 0, 

ax - ax 



axE ( In ( ^ 



axEl - In 1 



< E(aA - Sa) + 



ax 

41n(2x"^) + 2 



< (41n(2x-^) + 2) 



OA 

1 



E((aA-aA)') 



+ 



\Bk\ ax 



< (41n(2x-i) + 2) l^^^+x^CregiS)^ . 



We deduce that 



Hence, applying Lemma 26 with a = 1 and the previous bound in expecta- 
tion, we obtain 



K= 



_inf ^£{sK,So) > Ve {4:\n{2x-^) + 2) D + x'^CregiS)^ | < 5. 

This concludes the proof, choosing A = 2 and x = n~^. 

B.4. Proof of Proposition 14. We have 
var {j{sk) - 7(so) I ) 



var (sKiX) - SoiX) f + 2{so{X) - sk{X) ){Y- So{X) ] 



Xb^ 



+ 8 var ( ( So{X) - sk{X) ) {Y - So{X) ) \ Xb^ \ 



Xbi, 



<2var 



<2E {isKiX)-So{X)y 



Xbt, 



+ 8E {so{X) - sk{X) f{Y- So{X) f 
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Let us now consider an orthonormal basis (in L?'{Px)) (V'a)agA of S and let 
us write 

So = ^oaV'a, sk = ^ax{K)^\, hence P(so - s/^)^ = ^(oa -a\{K)f. 
aga AeA agA 

From Cauchy-Schwarz inequality ^' = ^x&k ^A' h^nce, from Cauchy-Schwarz 
inequality, we have 



{sk{X) - So{X)f = iY.^ax-MK))^x 
VagA 

< I 5^(aA-aA(i^))M (E^a) =t{sK,So)^. 
-AgA / VagA / 



Thus, we have 

Moreover, from Cauchy-Schwarz inequality. 



k(^{so{X) -SK{X)f {Y - so{x)y 



Xbk <i{sK,So)D. 



We deduce that (CMarg) holds, with ct^ = 2Mvj,, = 1, ai = 1/2, af 
8D. 



B.5. Proof of Proposition 15. Theorem 9 and lAAM^V < n ensure 

i < i 

A - 2' 



that, for all A > 4, we have i^n(A) = 6a/ MmY- _|_ i < 1 hence, the estimator 



(7) satisfies, with probability larger than 1 — 5, 
(17) 1{SK^ ,s,)< e{so, + ( 1 + 8m„ ( A) ) inf { e(sK , s.) } + SL? A 



K n 



In order to control inf^ {£(s/<, s^) }, we use the following method, due to 
Saumard [30]. We fix A' in 1, . . .V and, for all constants C, we denote by 

gc = {t£S, i{t,so)<c}, g^c = {t£S, i{t,so)>c}. 
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We have, by definition of 'sk, 

e{sK, So)>C^ inf PsAlit) - l{so)) > ^ inf PBAli^) " 7(^o)) 

sup PsjAliso) - lit)) < sup Psjiiliso) - j{t)) 
teGc teG>c 

^ sup { (Pb^ - P)(7(so) - lit)) - i{t, So) } 
teGc 

< sup {{Ps^ - P){j{so) - lit)) - e{t,so)} 

teg>c 

sup {{PB^-P){l{So)-l{t))-£{t,So)}>0. 

t&Q>c 
Let us now write 

^{so) - lit) = {Y - So{X))^ - {Y - t{X))^ 

= -2{Y - So{X)){so{X) - t{X)) - {so{X) - tiX))^ . 

Given an ortlionormal basis (in L'^(Px)) (V'A)AeA of S, we write 

So{X)-t{X) = Y,ax^x- 
aga 

Hence, we liave, for all t in S, for all e > 0, using Cauchy-Scliwarz inequality, 

(Ps^-P){j{So)-l{t)) 

= -2Y,o.x{Pb^-P){{Y -So)iJx)- «AaA'(^BK-^)(V'A^A') 
aga a,a'ga 

< 2 IYA. In ( -P){{Y-So)M)' 

y AgA y AGA 

+ E"l/ E {{Pb,- - p){^^xM)' 

AgA Y a.a'gA 
AgA 

+ ^ Y. UPb,- - P)ii^xM)'^ Kt, so). 
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We have obtained that if Sq) > C then 
sup {- Y,{{P^^^-P)({Y-s,)^|;,)f 



1 - e - ^ |] {{Pb^- P)(Va^A'))' j ^{t,so) I > 0. 

Let = {T.x,x'&k{{Pbk - P){^Mf < 1/4} and take e = 1/4, we 
deduce from the previous bound that, on H.^, 

£{SK, So) > C ^ 16 J] ( {Pb^ -P)i(Y- So)Va )f>C. 
AeA 

This means that £(s_k, So)!^^' < 16 XIaga ( (-^^k " -P) ( " So)^A ) in 
particular, 

w(fr ^^ ,.,, EAeAvar((y-.o)^A) ^^ /((^-^°)^-^(-^)) 

l-DA'l I^A'I 

D DV 

(18) = 16-— < 32 . 

I-DA-I n 

We have 



a,a'ga 



Hence, if r = (1 — Vl — e ^)/2, by Markov property. 



((p.,--p)(^A^v)f >^)<^ 

A,A'eA y 
As r > l/(12e) and 24e^^ < i, we deduce 
(19) P(J)^)<r. 
From Lemma 27 and equations (18), (19), we deduce that 

Pf inf i( SK, So) > I28e'^—] <2e-^ . 
\K=i,...,v n J 

Together with (17), this yields, for A = y/28eL^^ , 

P I i{sK, , > ( 384 + 128\/2eLi ) ^ | < + 25^ 
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APPENDIX C: PROOFS IN THE MIXING-CASE 

C.l. Proof of Proposition 17. Let a G {0,1} and let us recall the 
following lemma due to Viennet [32]. 

Lemma 21. (Lemma 5.1 in [32]) Assume that the process [Xi, ...,Xn) 
is 13-mixing. There exist random variables {Ak)k=i,...,v such that: 

1. \/K = 1, . . . ,y, Ak has the same law as {Xi)i^B2K-i+a' 

2. \/K = 1, . . . ,V, Ak is independent o/ (Xi)igBj^^ , {Xi)i(zB2^j^^-,^^-,^^, 
Ai, ...,Ak-i, 

3. yK = l,...,V, P((X,)ieB,^_,+, / Ak) < Pg. 

Let us define the event 

^coup = {yK = 1, . . . ,V, (Xi)igB2K-i+a = Ak} . 

It comes from Viennet's Lemma that fj^^coup} — 1 ~ ^f^g- For all K = 
1, . . . ,V, let Ak = {Yi)i^BK ^^^d for all measurable t, let 

On ^coup, we have PB^K-i+af = ^AkI, hence, 
(20) 

p{x) :=P{Med{PB,,/, ir = l,3,...,2y-l + a}-P/>xnO^„„p} 



<P|Card{K = l,3,...,2y-l + a, s.t. Pb^I - Pf > x} > - n 

<p|card{K = l,...,y, s.t. PA^f-Pf>x}>^ 
By Markov inequality, for all r, for all if = 1, . . . , y. 



^^^,_^,,./var(P..,/-P/) 



v^riPB^f-Pf) 



PB^f-Pf>\ ^ }<r 



From Lemma 23, we deduce, denoting by Xr = '^^^^^^'^J and by 
B(y,r) a binomial random variable with parameters V and r, that 

(21) p(x,.) < p|s(y,r) > ^1 < e^'''(^^^ 
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We choose r = (1 — Vl — e~2)/2 so that In ^ ^r(^i_r) ) ~ ^' recall now the 
following inequality for mixing processes (see for example [16] inequalities 
(6.2) and (6.3) or Lemma 4.1 in [32]). 

Lemma 22. Let {Xn)nei be (j)-mixing data. There exists a function v 
satisfying, for all p, q in N, 
(22) 

1/ = ^i/i, with Pug < Pg, P{uP) < pY,{1 + ly-^Pi and < c^q , 

l>0 l>0 
such that, for all t in L'^{fi), 

(23) var^^t(Xi)^ <AqP{vt^) . 

We apply Lemma 22 and we obtain, when Cp < oo, Vif =1,3,..., 2V — 1 

var{PB,f-Pf) < TT^Pi^f) < —VPii^')P{n ■ 

Moreover, when Xi-n is mixing, with (j)q ^ we apply Lemma 22 

to f -Pf and we get VJC = 1, 3, . . . , 21/ - 1, 

var (Pb;,/ -Pf)< ^ ^„ P{Hf - Pf?) < 8<I>2- varp/ . 

Card Bk n 

Plugging these inequalities in (21) yields the result, since we have 



xr < V ^ ^"^"^ - ) < ^8e varp ( (P., - P)f) . 

C.2. Proof of Theorem 18. Let us denote by f^(c) the same event 
as in the proof of Theorem 9. We can apply Lemma 17 to the function 
/ = {i(sk) ~ 7('5o) ) 1Q(c) conditionally to the random variables Xb^k^i 
and to the partition {B2J-i)jjLk,K' oiI/{B2K-i^B2K'-i), with cardinality 
n/2 - \B2K^i\ - |^2i^'-i| > n(l/2 - 2/V - 2/n) > n/4 since V > 16. 
Condition (CMarg) implies 

N 

varp ((7(5^-) - 7(so)) ^Bax-i) < crli{sK, So?+^afi{sK, So?"'* ■ 

i=l 

We introduce 

Co = 8^/^<l>, and Vi = 1, . . . , N, Ci = {1 - m) (Co(ai)"' )i/(^-"') . 
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Since V > ln(25~^), there exists a set ilgood satisfying P{f^good} > 1 — V^q 
such that, with probabihty larger than 1 — (5^, conditionally to Xb2k_i, the 
following event holds 



V 



< Cn 



N 



2ai 



i=l 



(PKK'-P)hi-^K)-l{So)) 



^(7oe{sK, So) + aiiisK, SoT' ^ . 



where r2j„t = ^good^^{c)- As the bound on the probability does not depend 
on X^jK-i' same bound holds unconditionally. We use repeatedly the 
classical inequality a°^h^~'^ < aa + {1 — a)b, we obtain that 



Coy — ( aoi{sK,So) + 

\ i=l 



ait[SK, So 



V N 



N 



i=l 



1 

1 — a,- 



Using a union bound, we get that the following event holds with probability 
larger than 1 - V{V - 1)5^ - e - Vpq, yK,K' = 1,..., V, 



K,K' 



P) ilisK) - lisK')) 



V N 



N 



< Coao\- + - {i{sK,So) + i{sK',So)) + 2Y,Ci A'^'ai 



i=l 



1 



n A 

We conclude as in the proof of Theorem 9. 

APPENDIX D: TOOLS 

D.l. A coupling result. The aim of this section is to prove the cou- 
pling result used in the proof of Proposition 1. 

Lemma 23. Let Yi-^ be independent random variables, let x be a real 
number and let 



^ = Card{z = 1,. . . ,iV s.t. Yi>x}. 
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Let p G (0, 1] such that, for all i = I, . . . , N , p > ¥ {Yi > x} and let B 
he a random variable with a binomial law B{N,p). There exists a coupling 
C = {A, B) such that A has the same distribution as A, B has the same 
distribution as B and such that A < B. In particular, for all y > 0, 

(24) ¥{A > y} <¥{B > y} . 

Proof. Let Ui-n be i.i.d random variables with uniform distribution on 
[0, 1]. Let us define, for all i = 1, . . . , A^, 

TV TV 

Ai = l[/^<p{y^>2.}, Bi = lui<p, A = ^ Ai, B = ^ Bi . 

i=l i=l 

By construction, for all i = 1, . . . , A^, Ai < Bi, hence A < B. Moreover, B 
is the sum of independent random variables with common Bernoulli distri- 
bution of parameter p, it has therefore the same distribution as B. We also 
have that (^i)i=i,.--,A^ same distribution as {lYi>x)i=i,...,N since the 

marginals have the same distributions and the coordinates are independent. 
Since A = X^^^ lY^yx, A and A have the same distribution. 

In order to prove (24), we just say that A < B implies that | Jl > y | C 

I i? > y I , hence 

□ 

D.2. Concentration inequalities for Estimator Selection. 

Lemma 24. Let {Sm)meM be a collection of linear spaces of measurable 
functions and for all m G A4, let (V'A)AgA^ be an orthonormal basis of 
Sm. and = SasA V'a- IT be a probability measure on 7W. For all 
6 G (0,1), we denote by 



n 



Let Xi-n be i.i.d, ^-valued, random variables with common density G 
L'^{ji). For all m G M, let Sm be the orthogonal projection of S-k onto Sm 
and assume that (CSED) holds. Let us denote, for all 6 G (0, 1), 



ymGM, r/oo,2(lV||s||)^/||s^-s^f + ^^^ln( ) < em{S) . 

n \ d7r[m) 
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Then, with probability 1 — 5, the following event holds. Mm G M., 



\{Pn - P){Sm.„ - Sm)\ < -l^£m{S) \\Sm " + 



\ n 
E [{Pn - P)i^xf < {1 + L,u)^ + + ^rliS) . 

AGAm 

Proof. Let Zm = Z]AeA,„ K-^" ~ P)'^\?'- We proved the following con- 
centration inequality for Zm in [23]. Let = ^'^Vt£M(ni) -f^^]? fo'^ 
some absolute Lq < 16(ln2)~^ + 8, we have, for all 5 > 0, > 0, with 
probability larger than 1 — 5/2, 

P^m . . ( P^m ^ 1 ( vlH'^/5) , \\^m\\^{H2/5)f 
S -t^O V 1 1 



n \ n u \ n v'^n^ 



We have 

,.2 



<< Vll^mlloc sup P[\t\\ < x/ll^-r 
ieB{m) 



Hence, if we denote by 



7r(m)(5 



n 

we obtain, 

Vm G Al, > (1 + Loi^)^ + ^^^Tm{5) + ^rl{5) 1 < ^ . 

Moreover, Bernstein's inequality yields, for all m £ A^, for all nio, and for 
all 5 S (0, 1), with probability larger than 1 — 5/2, 



{Pn-P){Sm^-Sm)\ < 



2vaip{sm - Smo)ln{i/5) ^ - Sm J I qq ln(4/(5) 
n 3n 



Using Cauchy-Schwarz inequality and assumption (CSED), we deduce that, 
if nio is a minimizer of ||sm, — s^||^ + P^rn/n-, 

Varp(Sm ^ruo ) — W^m. ■'^m' || oo P\^m I 

< ??oo,2^'(^m + ^m,J ||s|| || 

^« II II Al l|2^^^"'' 

< 8nr]oo,2\\s\\ [ \\Sm - Si,\\ H 

\ n 
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Actually, we have ||s* - s^JI^ < ||s* - Smo\? + P^mo/n < - + 
P'^m/n, hence, by the triangular inequality. 



<2\\Sm-sJ\ +2 S^-SmJ < 4 - + 



n 



Moreover, P^^^ < n\\s^ - Smolf + P^Pmo < n\\sm - s^^lf + P'^m, hence, 

,,2 -P^" 

+ ^-^J < 2n \\sm - s^lr + 



n 

We also have 

- ■SmJlooln(4/((57r(m))) 
3n 

/ , hrioo,2P{^m + ^m j ||Sm " Sm^ f ln(4/((57r(m))) 

- V ^ 

Voo,2P{'^m + ^mo) ln(4/(,5^(m))) 
18n 

2^2 , / 4 \ /„ „2 P^-r ^ 

3 \ OTT[m) J \ n 

Using a union bound, we deduce that, with probability larger than 1 — 6/2, 
Mm G M, 

\{Pn - P){Smo - Sm)\ < ^^em(5) ( \\Sm - S^\f + ^^'^ 



3 1 1 ii \ / 1 \\ 1 1 1/ '^11 
\ n 

Let us then denote by Li(e) = 3e^^Lo + WV2/3 and by □ 
D.3. Control of the infimum. 

Lemma 25. Let Xi-n be i.i.d random variables and let 6 > such that 



ln(r2) 
ln(5-2) 



< n/2. Let B be a regular partition of {l,...,n}, with V = 
V 8. Let {^k) be a sequence of estimators such that, for all K , 
'sk = f{XBj()- Let a G (0,1) and let Xa be a real number satisfying the 
following property. 



Card|K = 1,...,V, F{i{sK,So) > x„) < e~27? | > aV. 
Then we have 



^ inf ^i{^K,So) > Xa ] < 5. 



44 



M. LERASLE AND R. OLIVEIRA 



Proof. By independence of the Xbj^, 

V 



pf ^ inf ^AsK,So)>Xa) = f] r{e{sK,So) > Xa) < {e~^r^ < 5. 

\ K=l,...,v } 

□ 

An elementary application of the previous result is the following lemma. 
Lemma 26. Let Xi-^n be i.i.d random variables and let 6 > such that 



ln(,5-2) 
ln(,5-2) 



< n/2. Let B be a regular partition of {1, . . . ,n} , with V = 
V 8. Let {^k) be a sequence of estimators such that, for all K, 



'sk = f{XBii)- Let a G (0, 1) and let > such that 

Card {K = 1,...,V, E{i{sK, So)) < E^} > aV. 
Then, we have 

inf ^(si^, So) < e^E'o,^ >l - 5. 

Proof. From Markov inequality, 

P(^(?i^,So) < e^¥.{l{sK,So))) < e-^. 

Hence, the result follows from Lemma 25 and the assumption on E^. □ 

Lemma 27. Let Xi-n be i.i.d random variables with common measure 
P. Let B be a regular partition of with cardinality larger than 

[ln((5-2)] . Letr={l- Vl - e-2) /2 and assume that there exist a collection 
of independent events {^k)k=i,...,v such that 

VK = l,...,y, Eii{sK,So)lnK)<E, P{17^}<r. 

Then, we have 

inf i(sK,So) > 4:6^ E] < 2e-^ . 

K=1,...,V ) 

Proof. We introduce the random variables Yk = lr2= • We apply Lemma 
23 with X = 1/2, and p = r. Denoting by B{V,r) a random variable with 
binomial distribution of parameters V and r, we obtain 

P I Card{i^ = 1,...,V, s.t. holds} > ^ 1 < P | B{V, r) > ^ 1 . 
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Let us denote hy = {K = I, . . . ,V, s.t. $7^ holds} and = Cardfi. By- 
inequality (2.8) in [26] 



iV > — I < e 2 4r{i-r) ; = g- 
- 2 



V 



As a consequence, 



(25) p|y-iV<^|<e-^ 
Moreover, for all x > 4, we have 

V ' 

inf £(sK,So) > xE n N < — 
K=h...y 2 



^ pj _inf £(si^,So) > x^nJ]^ = a| 

ylC{l,...,n}, CardA>|^ 

< J2 rlMi{sK,so)>xEnn^ = A\ 

Ac{l,...,n},Ca.TdA>^ 



ylC{l,...,n}, CardA>^ 



ylC{l, .••,"}, CardA> 



V 



\ Card A 



We conclude the proof with the fact that, if x = 4e^, {1 / x)'^^'^'^ ^ < x < 
(2e)~^ and the fact that there is less than 2^ terms in the sum. □ 
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